Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve their services so that customers do not renounce their credit cards

Objective

Explore and visualize the dataset.
Build a classification model to predict if the customer is going to churn or not
Optimize the model using appropriate techniques
Generate a set of insights and recommendations that will help the bank

Data Dictionary:

CLIENTNUM: Client number. Unique identifier for the customer holding the account
Attrition_Flag: Internal event (customer activity) variable - if the account is closed then 1 else 0
Customer_Age: Age in Years
Gender: Gender of the account holder
Dependent_count: Number of dependents
Education_Level: Educational Qualification of the account holder
Marital_Status: Marital Status of the account holder
Income_Category: Annual Income Category of the account holder
Card_Category: Type of Card
Months_on_book: Period of relationship with the bank
Total_Relationship_Count: Total no. of products held by the customer
Months_Inactive_12_mon: No. of months inactive in the last 12 months
Contacts_Count_12_mon: No. of Contacts in the last 12 months
Credit_Limit: Credit Limit on the Credit Card
Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
Total_Trans_Amt: Total Transaction Amount (Last 12 months)
Total_Trans_Ct: Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Overview of the dataset

Import necessary Libraries

View the first 5 and last 5 rows of the data set

Understand the shape of the dataset

Check the Datatypes and columns of the dataset

We dont see any missing values in the columns
There are 10 int variables,5 float variables and 6 object variables

Summary of the dataset

CLIENTNUM:The client number being a unique identifier does not add value to the analysis and can be dropped.
Age of customers range from 26 to 73
Customers have min 13 years of relationship with the Bank
The max number of products held is 6
Credit limit ranges from 1438 to 34516
Avg_open_to_buy has a huge range from amount 3 to 34516

Most of the customers are existing customers who has not attritted yet
Most of the customers are Female ,have graduation and are married
Most of the customers make less than 40K income
The card category that most customers hold is Blue.
Attrition_Flag is our target variable

Fixing the Datatypes

Observations on Customer_Age

Age is almost normally distributed
We can see couple of outliers where age >70

Observations on Months_on_book

The mean and median value is same at 36
We can see some outliers where months < 15 and months > 50

Observations on Credit_Limit

Credit_Limit is left skewed.
We can see that credit limit greater than 25000 are outliers

Observations on Total_Revolving_Bal

Total_Revolving_Bal is left skewed
Most of the customers had a revolving balance of 0

Observations on Avg_Open_To_Buy

The distribution is right skewed
The amount left to use in the credit line is showing outliers for amount greater than about 230000

Observations on Total_Amt_Chng_Q4_Q1

We can outliers greater than 1.2 and less than 0.3.We have an almost normal distribution

Observations on Total_Trans_Amt

The mean is at about 4500 for Total_Trans_Amt and we have outliers above 8500

Observations on Total_Trans_Ct

Total_Trans_Ct has a median value of 67.We have couple of outliers greater than 135

Observations on Total_Ct_Chng_Q4_Q1

IQR is between 0.55 and 0.8 there are outliers above 1.2 and below 0.25.
We have a normal distribution

Observations on Avg_Utilization_Ratio

The distribution is left skewed with median of about 0.18

Observations on Attrition_Flag

16.1% of customers has attrited and 83.9% are exisitng customers

Observations on Gender

52.9% are Females and 47.1 % are Males

Observations on Dependent_count

27% of customers have 3 dependents,26.2% have 2 dependents and 18.1% has 1 dependent

Observations on Education_Level

30.9% customers are graduates followed by 19.9% who has High School education. We have 15% showing as Unknown .We need to consider this as missing value and treat it later

Observations on Marital_Status

We have 46.3% of customers who are married.38.9% Single nad 7.4% Divorced.
We see 7.4% of customers who are unknown .Will treat them as missing

Observations on Income_Category

35.2% percent customers make less than $40K,followed by 17.7% who make 40K-60K.
Here also we have 11% unknowns which needs to be treated as missing values

Observations on Card_Category

93.2% customers have Blue card.Only very few customers have the other card types

Observations on Total_Relationship_Count

22.8% customers hold 3 products with the bank About 18% each hold 4 products and 5 and 6 products

Observations on Months_Inactive_12_mon

38.0% customers had their card inactive for 3 months followed by 32.4% who had not used their card for 2 months.
1.2% have not used their card for 6 months

Observations on Contacts_Count_12_mon

33.4% of the customers had contacted the bank or the bank has contacted the customer 3 times in the past 12 months & 31.9% we contacted 2 times

Bivariate Analysis

Bivariate Analysis of Attrition_Flag with categorical variables

More Female customers have attrited than Male customers

We can see slightly more attrition for customers who have 3 or 4 dependants.The attrition for customers with 0,1,2,5 remains almost the same

We can see that attrition is higher for customers who have Doctorate degree followed by Post-Graduate degree.
Customers with Graduate,college,High School and are uneducated have the same level of attrition

We donot see much difference in attrition between Single,Married,Divorsed customers

Attrition is slightly more for customers with income Less than 40k but this could also be because majority of customers are in this income category .We would need to analyse this further

We can see that most of the attrited customers are using Platinum credit card
Customers using Gold,Blue,Silver cards have the same level of attrition

We can see that attrition is more for customers who hold 1 or 2 products compared to customers who hold more 4 or more products

IneErestingly customers who has not been inactive in the last 12 months are showing the highest attrition.
Customers who have been inactive for 3 or 4 months are also showing higher attritions

Attrition has been higher when there is higher number of contacts with the Bank in the last 12 months.

Bivariate Analysis of Attrition_Flag with continuous variables

There is not much difference in the Age of customers who have attrited and who are still existing customer
Most of the Attrited customers have more than 30 and less than 40 months_on_book(Period od relation with Bank). But there is no significant difference between the period of relation with the Bank with exisitng customers and attrited customers
Attrited customers have low Total_Revolving_Bal compared to existing customers There is not much difference in the credit limit(Avg_Open_to_Buy) available to existing and attrited customers. Change in transaction amount from Q4 over Q1 (Total_Amt_Chng_Q4_Q1) is less for attrited customers than existing customers
Total Transaction Amount in the last 12 months is less for attrited customers than existing customers There is a big difference in the total transaction count for attrited and existing customers.Attrited customers have less number of transactions compared to existing customers
Change in transaction count for attrited customers is less than for exisitng customers
Average utilization ratio for attrited customers is less compared to exisitng customers.

Attrition Flag has slight positive correlation with Customer Age,Dependent_count & Month_on_book.
Attrition Flag has slight negative correlation with Total_Revolving_Bal,Total_Amt_Chng_Q4_Q1,Total_Tran_Ct,Total_Ct_Chng_q4_q1 & Avg_Utilization_Ratio
There is strong positive correlation between Customer_Age and Months_on_book which implies that customers have long relationship with the bank
Total_Revolving_Bal has positive correlation with Avg_Utilization_Ratio
Total_Tran_Amt & Total_Trans_Ct has positive correlation
Avg_Utilization_Ratio & Avg_Open_To_Buy has negative correlation

Pair Plot

We can see positive correlation for Customer_Age and Months_On Book and also positive correlation between Avg_Open_to_Buy and Credit_Limit

Insights Based On EDA

Based on our Analysis we have observed that out of the total customers 16.1% has attrited ans 83.9% are existing customers. More Female customers have attrited than male customers.
93.2% customers hold the Blue card but we can see that most of the attrited customers are using Platinum credit card We can see slightly more attrition for customers who have 3 or 4 dependants
We can see that attrition is higher for customers who have Doctorate degree followed by Post-Graduate degree.
Attrition is slightly more for customers with income Less than 40k but this could also be because majority of customers are in this income category Customers who have been inactive for 3 or 4 months are also showing higher attritions
Also Attrition has been higher when there is higher number of contacts with the Bank in the last 12 months

Outlier Detection

All outlier values looks acceptable for the dataset and we donot dee anything unsual so we will not be treating them at this point

Let us look at unique values of all the categories

We can see Unknown values for the columns Education_Level,Marital_Status & Income_Category which can be treated as missing values

Missing Value Detection & Treatment

Education_Level has 15% of the missing values out of the total observations Marital_Status has 7.4% of the missing values out of the total observations Income Category has 10.98% of the missing values out of the total observations

Missing Value Treatment

We will use KNN imputer to impute missing values.
KNNImputer: Each sample's missing values are imputed by looking at the n_neighbors nearest neighbors found in the training set. Default value for n_neighbors=5.
KNN imputer replaces missing values using the average of k nearest non-missing feature values.
Nearest points are found based on euclidean distance.

Values are encoded

Split Data

Impute Missing Values using KNN Imputer

All missing values have been treated.
Let's inverse map the encoded values.

Checking inverse mapped values/categories

Encoding categorical variables

After encoding there are 47 columns

Building the model

Model evaluation criterion:

Model can make wrong predictions as:

Predicting a customer will renounce the credit card services but does not renounce- Loss of resources
Predicting a customer will not renounce the credit card services but leaves - Loss of income

Which case is more important?

Predicting that customer will not renounce the credit card but he does i.e. losing on a potential source of income for the company because that customer will not targeted by the bank to provide services to retain them

How to reduce this loss i.e need to reduce False Negatives?

Company wants Recall to be maximized, greater the Recall lesser the chances of false negatives.

Logistic Regression

Let's evaluate the model performance by using KFold and cross_val_score

K-Folds cross-validator provides dataset indices to split data into train/validation sets. Split dataset into k consecutive stratified folds (without shuffling by default). Each fold is then used once as validation while the k - 1 remaining folds form the training set.

Performance on training set varies between 0.27 and .55 Checking performance on test data

Logistic Regression has given a generalized performance on training and test set.
Recall is very low, we can try oversampling (increase training data) to see if the model performance can be improved.

Oversampling train data using SMOTE

Logistic Regression on oversampled data

Evaluate model performance using KFold and cross_val_score

Performance of model on training set varies between 0.75 to 0.88, which is an improvement from the previous model Let's check the performance on the test set.

Performance on the training set improved but the model is not able to replicate the same for the test set. Model is overfitting.
Lets try:

a) Regularization to see if overfitting can be reduced

b) Undersampling the train to handle the imbalance between classes and check the model performance.

Regularization

Model seems to perform well on testing data but there still seems to be overfitting Let's try undersampling

Undersampling train data using SMOTE

Logistic Regression on undersampled data

Evaluate model performance using KFold and cross_val_score

Performance of model on training set varies between 0..68 to 0.82, which is an improvement from the initial model(without oversampling)
Let's check the performance on the test set.

Model has given a generalized performance on training and test set.
Model performance has improved using downsampling - Logistic regression is now able to differentiate well between positive and negative classes.

Model Comparison -Logistic Regression

Logistic regression model on undersampled data has given a generalized performance with the highest recall on test data.

Finding the coefficients

Converting coefficients to odd

Conclusion

Customer_Age:For a one-unit increase in the Customer_Age, we expect to see about a 11.5% increase in the odds of a customer attriting
Total_Revolving_Bal: For a one-unit decrease in total_revolving_bal, we expect to see about .056% increase in the odds of a customer attriting
Total_Trans_Amt:For a one-unit decrease in Total_Trans_Amt, we expect to see about .045% increase in the odds of a customer attriting
Total_Trans_Ct:For a one-unit decrease in Total_Trans_Ct, we expect to see about 9.4% increase in the odds of a customer attriting
Total_Relationship_Count_2:For a one-unit increase in in Total_Relationship_Ct_2, we expect to see about 7.4% increase in the odds of a customer attriting
Months_Inactive_12_mon_3:For a one-unit increase in the Months_inactive_12_mon_3(customers inactive for 3 months in the last 12 months), we expect to see about a 10.5% increase in the odds of a customer attriting

Similarly we can calculate the odds of a customer attriting based on the coef values of other attributes

Model Building Bagging and Boosting

We can see that XGBoost is giving the highest cross validated recall followed by Gradient Boosting and AdaBoost

Hyperparameter Tuning using Grid search & Random search for all models

We will use pipelines with StandardScalerl and tune the models using GridSearchCV and RandomizedSearchCV. We will also compare the performance and time taken by these two methods - grid search and randomized search.

We can also use make_pipeline function instead of Pipeline to create a pipeline.

Creating two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

AdaBoost

GridSearchCV

The test recall is similar to the cross validated recall and there is slight overfitting

RandomizedSearchCV

The parameters from random search are slightly different as compared to grid search except the base estimator.
Test recall is approx same from both the methods also the model is slightly overfitting the data more for parameters given by random search.

Random Forest

GridSearchCV

The test recall has increased by 9% as compared to the result from cross validation with default parameters.
The model seems to be a bit overfitting

RandomizedSearchCV

The test recall has increased by 8% as compared to the result from cross validation with default parameters but is less than recall from GridSearch The model seems to be a bit overfitting

XGBoost

GridSearchCV

The test recall has increased by ~7% as compared to the result from cross validation with default parameters.
The model is generalizing well on test and training data

RandomizedSearchCV

Grid search is giving slightly better results than random search and also here the model is slightly overfitting

Decision Tree Classifier

GridSearchCV

The test recall has decreased as compared to the result from cross validation with default parameters. The model is slightly overfit

RandomizedSearchCV

The test recall has decreased as compared to the result from cross validation with default parameters.
The model is slightly overfit

Bagging Classifier

GridSearchCV

The test recall has slightly increased compared to the result from cross validation with default parameters.
The model is slightly overfit

RandomizedSearchCV

The test recall has slightly increased compared to the result from cross validation with default parameters.
The model is slightly overfit

Comparing all models for Performance and Time Taken

The XGBoost model tuned using grid search is giving the best test recall of 0.95 and it has comparable Accuracy and Precision as well but based on the time taken XGBoost model with Randomized search is giving close performance with very good timing
Also we can see that the Random Forest with Grid Search has taken the highest time for completion with around 5 and half hours.The least amount of time has been taken by the Decision Tree model with Randomized Search with just 6.27 secs.
Let's see the feature importance from the tuned xgboost model

Feature Importance-XGBoost

Total_Tran_Ct is the most important feature followed by Total_Trans_Amt & Total_Revolving_Bal

Actionable Insights & Business Recommendations

The Bank should focus on customers who have less Total_Trans_Ct in the last 12 months as these are customers who have high chances of attriting.Obviously since these customers donot make much transactions they indicate that they donot use the credit cards much.Bank can provide some offers for these customers to get them using thier credit cards Similarly customers who have done less transaction in the past 12 months and those who have low revolving balance in thier account have higher chances of attrition and thr bank must focus on such customers and provide offers to retain them
Also customers who hold 2 products with the bank are the ones showing higher attrition.Bank can encourage such customers to get enrolled for more products which will help retain such customers
Change in transaction amount from Q4 over Q1 (Total_Amt_Chng_Q4_Q1) is less for attrited customers than existing customers.Customers have to encouraged to spend more.Offers can be provided.
Customers who have been inactive for a month show high chances of attrition.Bank should focus on such customers as well.
Other factors to focus are customers having income less than 40K, customers who have had high number of contacts with the bank in the last 12 months.